Gathering Data for this Project
Gather each of the three pieces of data as described below in a Jupyter Notebook titled wrangle_act.ipynb:
The WeRateDogs Twitter archive. I am giving this file to you, so imagine it as a file on hand. Download this file manually by clicking the following link: twitter_archive_enhanced.csv
The tweet image predictions, i.e., what breed of dog (or other object, animal, etc.) is present in each tweet according to a neural network. This file (image_predictions.tsv) is hosted on Udacity's servers and should be downloaded programmatically using the Requests library and the following URL: https://d17h27t6h515a5.cloudfront.net/topher/2017/August/599fd2ad_image-predictions/image-predictions.tsv
Each tweet's retweet count and favorite ("like") count at minimum, and any additional data you find interesting. Using the tweet IDs in the WeRateDogs Twitter archive, query the Twitter API for each tweet's JSON data using Python's Tweepy library and store each tweet's entire set of JSON data in a file called tweet_json.txt file. Each tweet's JSON data should be written to its own line. Then read this .txt file line by line into a pandas DataFrame with (at minimum) tweet ID, retweet count, and favorite count. Note: do not include your Twitter API keys, secrets, and tokens in your project submission.
import pandas as pd
import requests
import os
df_dog=pd.read_csv('twitter-archive-enhanced.csv')
df_dog.head()
url='https://d17h27t6h515a5.cloudfront.net/topher/2017/August/599fd2ad_image-predictions/image-predictions.tsv'
response=requests.get(url)
with open(url.split('/')[-1],mode='wb') as file:
file.write(response.content)
df_breed=pd.read_csv('image-predictions.tsv',sep='\t')
df_breed.head()
import tweepy
accesskey=pd.read_csv('twittertoken.csv')
consumer_key = accesskey.consumer_key[0]
consumer_secret = accesskey.consumer_secret[0]
access_token = accesskey.access_token[0]
access_secret = accesskey.access_secret[0]
auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token, access_secret)
api = tweepy.API(auth_handler=auth,wait_on_rate_limit=True, wait_on_rate_limit_notify=True)
# api.get_status(id_of_tweet)
import json
# try:
# os.remove('tweet_json.txt')
# except:
# pass
if os.path.isfile('tweet_json.txt'):
print('file already exists')
else:
count=0
for tweet_id in df_dog.tweet_id:
try:
tweet=api.get_status(tweet_id,tweet_mode='extended')
writetweet=tweet._json
with open('tweet_json.txt',mode='a',encoding='utf-8') as file:
json.dump(writetweet,file)
file.write('\n')
count+=1
print(count,tweet._json.get('id_str'))
except:
with open('tweet_json.txt',mode='a',encoding='utf-8') as file:
file.write('\n')
count+=1
print(count,'TWEET NOT FOUND!')
Probe pretty-printed example of JSON, compare with tweet data dictionary to find important attributes
with open('tweet_json.txt',mode='r') as file:
tweet=json.dumps(json.loads(file.readline()),indent=4)
print(tweet)
dict keys of important features:
'id_str','favorite_count', 'retweet_count'
Features that might be interesting:
'followers_count' under user - do we need to normalize favorites/retweets by the number of current followers?
with open('tweet_json.txt',mode='r') as file:
tweet=json.loads(file.readline())
## twitter suggests grabbing id_str to ensure that full number is grabbed
## there are potential issues with assigned int types
tweet.get('id_str')
tweet.get('favorite_count')
tweet.get('retweet_count')
tweet.get('user').get('followers_count')
df_list=[]
with open('tweet_json.txt',mode='r') as file:
content = file.read().splitlines()
for line in content:
try:
tweet=json.loads(line)
tweet_id=tweet.get('id_str')
favorite_count=tweet.get('favorite_count')
retweet_count=tweet.get('retweet_count')
created_at=tweet.get('created_at')
followers_count=tweet.get('user').get('followers_count')
df_list.append({'tweet_id': tweet_id,
'favorite_count': favorite_count,
'retweet_count': retweet_count,
'followers_count': followers_count})
except:
pass
df_tweet = pd.DataFrame(df_list, columns = ['tweet_id', 'favorite_count', 'retweet_count','followers_count'])
df_tweet.head()
After gathering each of the above pieces of data, assess them visually and programmatically for quality and tidiness issues. Detect and document at least eight (8) quality issues and two (2) tidiness issues in your wrangle_act.ipynb Jupyter Notebook. To meet specifications, the issues that satisfy the Project Motivation (see the Key Points header on the previous page) must be assessed.
df_dog.info()
df_dog[df_dog.expanded_urls.isnull()][['tweet_id','in_reply_to_status_id','retweeted_status_id']]
df_dog.describe()
df_dog.rating_denominator.value_counts()
df_dog.rating_numerator.value_counts()
df_dog[df_dog.rating_numerator==2]
df_dog[df_dog.tweet_id.duplicated()]
df_dog.source.value_counts()
df_dog[df_dog.source=='<a href="http://vine.co" rel="nofollow">Vine - Make a Scene</a>']
df_dog.name.value_counts()
df_dog.doggo.value_counts()
df_dog[df_dog.doggo!='None']
df_tweet.info()
df_tweet.describe()
df_breed.info()
df_breed.describe()
df_breed
Some images are not categorized as dogs, with high confidence. Maybe these aren't dogs at all?
url=df_breed.jpg_url[df_breed.tweet_id==666051853826850816].iloc[0]
from PIL import Image
from io import BytesIO
r=requests.get(url)
i = Image.open(BytesIO(r.content))
i
list(df_dog.text[df_dog.tweet_id==666051853826850816])
This one is clearly not a dog (the NN did not make a classification mistake). Additionally, this type of image can explain some of the really low ratings for some images.
df_dog Quality issues:¶df_tweet Quality issues:¶df_dog but only 2347 in df_tweets. Some tweets were deleteddf_breed Quality issues:¶df_dog but only 2075 classified dogs in df_breed.the three sources can be joined into 1 table, as all values are measured on the same unit, tweet_id.
dog "stages" can be combined together into one category.
rating_numerator and rating_denominator can be combined into a single rating value.create one column with output of "doggo", "floofer", "pupper", "puppo" or "multiple" change data type to category
df_dog_clean=df_dog.copy()
df_tweet_clean=df_tweet.copy()
df_breed_clean=df_breed.copy()
df_dog.pupper.value_counts()
a=df_dog.doggo=="doggo"
b=df_dog.pupper=="pupper"
len(df_dog[a&b])
there should be 257-12=245 puppers
df_dog_clean[df_dog_clean.tweet_id==817777686764523521]
df_dog.iloc[:,13:].head()
df_stage=df_dog_clean.doggo+df_dog_clean.floofer+df_dog_clean.pupper+df_dog_clean.puppo
df_stage
replace_all function from https://gomputor.wordpress.com/2008/09/27/search-replace-multiple-words-or-characters-with-python/
rep = {"NoneNoneNoneNone": "None",
"doggoNoneNoneNone": "doggo",
"NoneflooferNoneNone": "floofer",
"NoneNonepupperNone": "pupper",
"NoneNoneNonepuppo": "puppo"}
def replace_all(text, dic):
for i, j in dic.items():
text = text.replace(i, j)
return text
df_stage=replace_all(df_stage.str,rep)
df_stage.value_counts()
df_stage=df_stage.str.replace(r'^doggo\w+','multiple')
df_stage.value_counts()
df_dog_clean['dog_stage']=df_stage
df_dog_clean.dog_stage.value_counts()
a=df_dog.doggo=="doggo"
b=df_dog.pupper=="pupper"
df_dog_clean[a&b].iloc[:,13:].head()
df_dog_clean[a].iloc[:,13:].sample(10,random_state=10)
df_dog_clean[b].iloc[:,13:].sample(10,random_state=10)
df_dog_clean.drop(['doggo','floofer','pupper','puppo'],axis=1,inplace=True)
df_dog_clean.dog_stage.astype('category')
df_dog_clean.info()
merge the three tables on tweet_id with inner join. Make sure that all types are int64 before merging
df_dog_clean.info()
df_tweet_clean.info()
df_breed_clean.info()
df_tweet to int64 type¶df_tweet_clean.tweet_id=df_tweet_clean.tweet_id.astype('int64')
df_tweet_clean.info()
df_dog_clean=df_dog_clean.merge(df_tweet_clean,on='tweet_id')
df_dog_clean=df_dog_clean.merge(df_breed_clean, on='tweet_id')
df_dog_clean.info()
remove any tweet that has a retweeted_status_id or a in reply_to_status_id
notretweets=df_dog_clean.retweeted_status_id.isnull()
notreplies=df_dog_clean.in_reply_to_status_id.isnull()
df_dog_clean=df_dog_clean[notreplies¬retweets]
df_dog_clean.info()
drop in_reply_to_status_id, in_reply_to_user_id, retweeted_status_id, retweeted_status_user_id, and retweeted_status_timestamp
drop_cols=['in_reply_to_status_id',
'in_reply_to_user_id',
'retweeted_status_id',
'retweeted_status_user_id',
'retweeted_status_timestamp']
df_dog_clean=df_dog_clean.drop(drop_cols,axis=1)
df_dog_clean.info()
Re-extract names requiring a capitalized word after template phrases that seem to proceed names. Probe misnamed dogs text to see if names are missed, or if the text simply has no name.
df_dog_clean.name.value_counts()
df_dog_clean.text[df_dog_clean.name=='a']
The base extraction seems to use "This is \w+|"Here is \w+" or a similar template to work, but it has a lot of false positives. Here we see that there are plenty of times that the dog name is not immediately after "This is ". One thing that should be focused on for regex extraction is that the dog names will always be capitalized. This will get rid of many false positives.
Also, looking at the list of false positives, there are actually names in some of the text. Two more keys that may pick up names are shown in these tweets:
df_dog_clean.text[1979]
df_dog_clean.text[2002]
It makes sense that "name is" or "named" may proceed the name of the dog, and they should be included in the extraction.
# note to self: (?: ) is a non-capture group, needed for or statement
names=df_dog_clean.text.str.extract('(?:[Tt]his is |named |name is |[Hh]ere is )([A-Z][\w\']+)',expand=True)
names[0].value_counts().head(10)
sum(names[0].isnull())
Next I need to probe the text of tweets that did not have a name extracted, maybe there are other key phrases that are missed. I can iterate over the names with new templates if necessary.
with pd.option_context('display.max_colwidth',-1):
print(df_dog_clean.text[names[0].isnull()])
A huge one I am missing is "Meet ". That makes sense. Also "Say hello to" pops up frequently.
names=df_dog_clean.text.str.extract('(?:[Tt]his is |[Mm]eet |hello to |named |name is |[Hh]ere is )([A-Z][\w\']+)',expand=True)
names[0].value_counts().head(10)
sum(names[0].isnull())
with pd.option_context('display.max_colwidth',-1):
print(df_dog_clean.text[names[0].isnull()])
There are indeed still some missed dogs, but it would be hard to caputure them all without increasing false name extractions. Owner names, Holidays and famous people names are capitalized; and they create inventive breed names for dogs that are also capitalized. One idea would be to use the previously extracted names as a lookup table to match new names in the tweets that have no extracted names. However, since many dogs have human names, it would still require some manual cleanup. Here we also can see some examples of how images with mutliple dogs (and dog names) complicate things. In the scope of this project, The current extraction seems to do pretty well. There are only 591 tweets without names, and most are truely without names.
df_dog_clean['name']=names
df_dog_clean.name.value_counts()
df_dog_clean[['name','text']].sample(15,random_state=10)
df_dog_clean.info()
Probe tweets with non-10 denominators to see if there is some common reason, or if the rating was not correctly extracted.
Modify rating extraction if necessary, and exclude tweets that have no ratings or strange ratings if necessary.
df_dog_clean.rating_denominator.value_counts()
list(df_dog_clean[df_dog_clean.rating_denominator!=10].text)
Tricky. Sometimes the wrong thing was extracted (24/7, 9/11, 7/11, 4/20, 50/50, 1/2), and sometimes they modify the rating system to sum all the ratings across multiple dogs (45/50=9/10 for 5 dogs). It seems that ratings always have a denominator that is divisible by 10, so we can build a template that excludes common fractionals that are not divisible by 10.
This will take care of 24/7,9/11,7/11, and 1/2. 4/20 and 50/50 will need to be carefully taken care of, as it might be possible that they can exist as valid ratings, especially 50/50.
The proper rating also seems to always be the last fraction in the text, which may be leveraged.
Potentially, we can additionally normalize ratings to be out of 10.
A important thing to note is that the tweet with 24/7 in it actually has no rating at all, and should be dropped.
df_dog_clean[df_dog_clean.rating_denominator!=10].text
# drop the tweet with 24/7 as it has no rating
df_dog_clean=df_dog_clean.drop(index=412,axis=0)
df_dog_clean[df_dog_clean.rating_denominator!=10].text
iterate using extractall a few times, checking the levels to find examples where more than one denominator is extracted.
denominator=df_dog_clean.text.str.extractall('/(\d+0)')
denominator.xs(1,level='match').head()
df_dog_clean.loc[21].text
We need to avoid extracting the urls
denominator=df_dog_clean.text.str.extractall('[^o]/(\d+0)')
denominator.xs(0,level='match').head()
# denominator
manual_rating_clean=denominator.xs(1,level='match').index.tolist()
list(df_dog_clean.loc[manual_rating_clean].text)
df_dog_clean.loc[612].text
df_dog_clean.loc[964].text
Now this is tricky, we know that 4/20 is not a score, but sometimes there are two legitimate scores. I can just check and manually set mutliple dog scores after cleanup for these tweets. I'll just take the last numerator and denominator for all tweets, and for these specific ones I'll make sure that they are correct, or that I add up the scores for multiple dogs (e.g. tweet index 612 will be 12/10 and 11/10->23/20).
denominator=denominator.reset_index()
denominator.loc[1648:1653]
denominator=denominator[~denominator.level_0.duplicated(keep='last')]
denominator=denominator.set_index('level_0',verify_integrity=True)
denominator.loc[1728:1731]
denominator[0]=denominator[0].astype('int64')
denominator.drop('match',axis=1,inplace=True)
denominator[0].value_counts()
df_dog_clean['rating_denominator']=denominator
df_dog_clean.info()
testset=[731, 873, 921, 964, 1402, 2049]
with pd.option_context('display.max_colwidth',-1):
print(df_dog_clean[['text','rating_denominator']].loc[testset])
Probe tweets with low rating numerators to see if there is some common reason, or if the rating was not correctly extracted.
df_dog_clean.rating_numerator
list(df_dog_clean[df_dog_clean.rating_numerator==144].text)
list(df_dog_clean[df_dog_clean.rating_numerator.isin([0,1,204,420,44,88,1776])].text)
A sampling of some numbers that stood out. Most of them look legitimate. Some correspond to multiple dogs and make sense. Some are referential numbers (e.g. 1776 for an "America af" dog), some are low because they are not dogs, and some are incorrectly extracted like they were with denominators above. a small modification of the denimonator extraction should work.
numerator=df_dog_clean.text.str.extractall('(\d+)/\d+0')
num_list=numerator.xs(1,level='match').index.tolist()
list(df_dog_clean.loc[num_list].text)
A lot of these look familiar, compare to the list that need to be manually fixed
num_list=numerator.xs(1,level='match').index.tolist()
list(set(num_list)-set(manual_rating_clean))
This means that the multi-extracted values here are the same as the ones with multiple extractions with denominator, these need to be fixed manually.
An important thing to check is for decimals. Are all ratings integers, and if not, are they captured well?
Here is an example with a decimal in the numerator, being misclassified.
list(df_dog_clean[df_dog_clean.tweet_id==883482846933004288].text)
df_dog_clean[df_dog_clean.tweet_id==883482846933004288].rating_numerator
numerator=df_dog_clean.text.str.extractall('(\d+\.?\d*)/\d+0')
num_list=numerator.xs(1,level='match').index.tolist()
# list(df_dog_clean.loc[num_list].text)
list(set(num_list)-set(manual_rating_clean))
numerator=numerator.reset_index()
numerator=numerator[~numerator.level_0.duplicated(keep='last')]
numerator=numerator.set_index('level_0',verify_integrity=True)
numerator[0]=numerator[0].astype('float64') #keep decimals
numerator.drop('match',axis=1,inplace=True)
numerator[0].value_counts()
numerator.info()
df_dog_clean['rating_numerator']=numerator
df_dog_clean.info()
with pd.option_context('display.max_colwidth',-1):
print(df_dog_clean[['text','rating_numerator']].loc[testset]) #same testset at denominator
df_dog_clean[df_dog_clean.tweet_id==883482846933004288][['text','rating_numerator']]
with pd.option_context('display.max_colwidth',-1):
for tweet in manual_rating_clean:
print(df_dog_clean[['text','rating_numerator','rating_denominator']].loc[tweet])
# print(df_dog_clean[['text','rating_numerator','rating_denominator']].loc[manual_rating_clean])
numerator=[23,17,13,11,18,11,13,15,15,10,21,21,17,14,13,8,10,19,19,17,9,10,15,20]
denominator=[20,20,10,10,20,20,10,20,20,10,20,20,20,20,20,10,10,20,20,20,20,10,20,20]
manualfix=pd.DataFrame(index=manual_rating_clean,data={'numerator':numerator,'denominator':denominator})
manualfix.numerator
df_dog_clean.rating_denominator.update(manualfix.denominator)
df_dog_clean.rating_numerator.update(manualfix.numerator)
df_dog_clean[['rating_numerator','rating_denominator']].loc[manual_rating_clean]
rating_numerator and rating_denominator can be combined into a single rating value.create a variable rating by dividing rating_numerator by rating_denominator, and a variable num_dogs by dividing rating_denominator by 10.
df_dog_clean['rating']=df_dog_clean.rating_numerator/df_dog_clean.rating_denominator
df_dog_clean['num_dogs']=df_dog_clean.rating_denominator/10
df_dog_clean[['rating','num_dogs']].sample(10,random_state=15)
df_dog_clean[['rating_numerator','rating_denominator','rating']].sample(10,random_state=150)
df_dog_clean[['rating_numerator','rating_denominator','rating']].loc[manual_rating_clean]
df_dog_clean.drop(['rating_numerator','rating_denominator'],axis=1,inplace=True)
df_dog_clean.info()
Inspect a sample of images that are not classified as dogs, and determine if they should be kept or not.
p1=df_dog_clean.p1_dog==False
p2=df_dog_clean.p2_dog==False
p3=df_dog_clean.p3_dog==False
sum(p1&p2&p3)
df_dog_clean[p1&p2&p3][['text','jpg_url','p1','p2','p3']].head(8)
i=[]
for x in range(8):
url=df_dog_clean.jpg_url[p1&p2&p3].iloc[x]
r=requests.get(url)
i.append(Image.open(BytesIO(r.content)))
for img in i:
display(img)
df_dog_clean.info()
These samples seem to show a big challenge with NN image classification: the neural network cannot choose what part of the image to "attend" to, and if the dog is not the largest object in the image or blends in, it has issues "focusing" on the dog to make a classification. Not to mention some of the images are humorously not dogs at all.
I don't think this has to be cleaned, but if we want to do an analysis with dog breeds, only highly confident dog images should be used.
convert timestamp type to datetime, tweet_id to string.
df_dog_clean.timestamp=pd.to_datetime(df_dog_clean.timestamp)
df_dog_clean.tweet_id=df_dog_clean.tweet_id.astype('str')
df_dog_clean.info()
df_dog_clean.sample(10,random_state=190)
df_dog_clean.to_csv('twitter_archive_master.csv')
Here are some insights into the data for the act_report.html.
Dog breed retweets and favorites
df_dog_clean.p1.value_counts().head(15)
set some filters for classification confidence, and for the number of dogs in the image.
confidence=df_dog_clean.p1_conf>.6
singledog=df_dog_clean.num_dogs==1
df_dog_clean[(df_dog_clean.p1=='Pomeranian')
&confidence&singledog][['favorite_count','retweet_count']].mean()
df_dog_clean[(df_dog_clean.p1=='pug')
&confidence&singledog][['favorite_count','retweet_count']].mean()
df_dog_clean[(df_dog_clean.p1=='Labrador_retriever')
&confidence&singledog][['favorite_count','retweet_count']].mean()
df_dog_clean[(df_dog_clean.p1=='golden_retriever')
&confidence&singledog][['favorite_count','retweet_count']].mean()
breed_contrast=df_dog_clean[(df_dog_clean.p1.isin(['golden_retriever',
'Labrador_retriever',
'Pomeranian','pug']))&confidence&singledog]
breed_contrast.head()
breed_contrast=breed_contrast.replace({'p1':{'golden_retriever': 'Golden Retriever','Labrador_retriever':'Labrador','pug':'Pug'}})
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
g=sns.boxplot(data=breed_contrast[breed_contrast.retweet_count<breed_contrast.retweet_count.quantile(.99)],
y='retweet_count',x='p1',
order=['Labrador','Golden Retriever','Pomeranian','Pug'])
# g.set_yscale('log')
plt.ylabel('Number of retweets',fontsize=15)
plt.xlabel('Breed',fontsize=15)
plt.title('Retweets vs. Dog Breed',fontsize=20)
plt.plot()
plt.tight_layout()
plt.savefig('retweetvsdog.png')
plt.show()
df_dog_clean.groupby('dog_stage')[['favorite_count','retweet_count']].median()
df_dog_clean.groupby('dog_stage').tweet_id.count()
df_dog_clean.groupby(df_dog_clean.name.notnull())[['favorite_count','retweet_count']].median()
g=sns.lmplot(data=df_dog_clean,x='favorite_count',y='retweet_count')
g.set(xticks=range(0,150001,50000))
plt.plot([-10000,140000],[-10000,140000],'k--')
plt.ylabel('Number of Retweets',fontsize=15)
plt.xlabel('Number of Favorites',fontsize=15)
plt.title("Retweets vs. Favorites",fontsize=20)
plt.tight_layout()
plt.savefig('retweetvsfavorites.png')
plt.show()
import numpy as np
np.corrcoef(df_dog_clean.retweet_count,df_dog_clean.favorite_count)[0,1]
df_dog_clean[df_dog_clean.rating<df_dog_clean.rating.quantile(.99)].rating.median()
g=sns.distplot(df_dog_clean[df_dog_clean.rating<df_dog_clean.rating.quantile(.99)].rating*10,
kde=False)
plt.plot(11.1,200,'r*',ms=15)
plt.ylabel('Count of ratings',fontsize=15)
plt.xlabel('Dog Rating',fontsize=15)
plt.title("Distribution of Ratings",fontsize=20)
plt.tight_layout()
plt.savefig('distofrating.png')
plt.show()